Multi-Task Deep Neural Networks for Natural Language Understanding

2019-06-17

本文主要研究的是结合语言模型预训练和多任务学习获取文本表示，在多个公开数据集上取得了最好的效果，同时在领域迁移实验中也取得了很好的泛化能力。ACL2019

Introduction

学习通用的文本表征是很多NLP任务的基础，目前两种通用的方法是多任务学习和语言模型预训练。本文的核心思想是把这两种方法相结合，作者认为MTL和语言模型预训练是相互补充的，因此在Liu et al. (2015)的基础上用Bert作为共享的编码器，提出了MT-DNN(Multi-Task Deep Neural Networks)。

The Proposed MT-DNN Model

MT-DNN结构如下图所示：

$Figure 1: Architecture of the MT-DNN model for representation learning. The lower layers are shared across all tasks while the top layers are task-specific. The input X (either a sentence or a pair of sentences) is first represented as a sequence of embedding vectors, one for each word, in $l_{1}$. Then the Transformer encoder captures the contextual information for each word and generates the shared contextual embedding vectors in $l_{2}$. Finally, for each task, additional task-specific layers generate task-specific representations, followed by operations necessary for classification, similarity scoring, or relevance ranking.$

Lexicon Encoder($l_{1}$)：设定输入为 $X=\{x_{1},…,x_{m}\}$，m为输入tokens的数量。类似于Devlin et al. (2018)，输入第一个token被设置成[CLS]，如果输入的是句子对$(X_{1},X_{2})$，在两个句子中间插入一个[SEP] token。lexicon encoder将词的word, segment, positional embeddings相加，得到每一个词的表征。

Transformer Encoder($l_{2}$)：论文使用了多层双向的Transformer encoder作为共享的特征表示，获取每个词的上下文表征。与Bert不同的是，MT-DNN是通过多任务学习来获取上下文表征，而不仅仅是语言模型预训练。

Task Specific Layers：任务特定层随任务的不同而有所差异，本文选取了GLUE中的四类任务为例：

Single-Sentence Classification Output：作者直接把[CLS] token对应的上下文表征作为句子表示，然后经过一个softmax分类层 $P_{r}(c | X)=\operatorname{softmax}\left(\mathbf{W}_{S S T}^{\top} \cdot \mathbf{x}\right)$
Text Similarity Output：作者直接把[CLS] token对应的上下文表征作为句子对$(X_{1},X_{2})$表示，然后通过一个前馈层计算相似度 $\operatorname{sim}\left(X_{1}, X_{2}\right)=\mathbf{w}_{S T S}^{\top} \cdot \mathbf{x}$
Pairwise Text Classification Output：以natural language inference (NLI) 任务为例，给定premise $P=(p_{1},…,p_{m})$ 和 hypothesis $H=(h_{1},…,h_{n})$，目标是预测两者之间的逻辑关系。作者采取了类似stochastic answer network (SAN) (Liu et al., 2018a)的方法。首先通过Transformer Encoder获取premise P和hypothesis H的上下文表征 $\mathbf{M}^{p} \in \mathbb{R}^{d \times m}$, $\mathbf{M}^{h} \in \mathbb{R}^{d \times n}$，然后基于此进行K步的推理，K是一个超参数。推理过程如下：假设初始状态 $\mathbf{s}^{0}=\sum_{j} \alpha_{j} \mathbf{M}_{j}^{h}$，$\alpha_{j}=\frac{\exp \left(\mathbf{w}_{1}^{\top} \cdot \mathbf{M}_{j}^{h}\right)}{\sum_{i} \exp \left(\mathbf{w}_{1}^{\top} \cdot \mathbf{M}_{i}^{h}\right)}$；在第$k \in \{1, 2, …,K-1\}$步时，状态 $\mathbf{s}^{k}=\mathrm{GRU}\left(\mathbf{s}^{k-1}, \mathbf{x}^{k}\right)$，这里$x^{k}$由上一步的状态$s^{k-1}$和memory $M^{p}$ 计算得到，
$$
\mathbf{x}^{k}=\sum_{j} \beta_{j} \mathbf{M}_{j}^{p}\\
\beta_{j}=\operatorname{softmax}\left(\mathbf{s}^{k-1} \mathbf{W}_{2}^{\top} \mathbf{M}^{p}\right)
$$
在每一个时间步k预测二者的关系，最后再把K个输出分数做平均：
$$
P_{r}^{k}=\operatorname{softmax}\left(\mathbf{W}_{3}^{\top}\left[\mathbf{s}^{k} ; \mathbf{x}^{k} ;\left|\mathbf{s}^{k}-\mathbf{x}^{k}\right| ; \mathbf{s}^{k} \cdot \mathbf{x}^{k}\right]\right)\\
P_{r}=\operatorname{avg}\left(\left[P_{r}^{0}, P_{r}^{1}, \ldots, P_{r}^{K-1}\right]\right)
$$
Relevance Ranking Output：以QNLI为例，给定问题Q和候选答案集，目的是进行相关性排序。仍然以[CLS] token 的上下文表征作为(Q, A)对的表示，计算相关性 $\operatorname{Rel}(Q, A)=g\left(\mathbf{w}_{Q N L I}^{\top} \cdot \mathbf{x}\right)$。

The Training Procedure

MT-DNN的训练包含两个步骤：预训练和多任务学习。其中，预训练参照BERT的实现，多任务学习过程如下图所示：

对于分类任务，采用交叉熵损失函数 $-\sum_{c} \mathbb{1}(X, c) \log \left(P_{r}(c | X)\right)$；对于文本相似度任务，采用MSE $\left(y-\operatorname{Sim}\left(X_{1}, X_{2}\right)\right)^{2}$；对于排序问题，给定Q和A的集合，A分为两个部分：一个$A^{+}$的正样本，$|A|-1$个负样本。最小化正样本的负极大似然概率：
$$
-\sum_{\left(Q, A^{+}\right)} P_{r}\left(A^{+} | Q\right)\\
P_{r}\left(A^{+} | Q\right)=\frac{\exp \left(\gamma \operatorname{Rel}\left(Q, A^{+}\right)\right)}{\sum_{A^{\prime} \in A} \exp \left(\gamma \operatorname{Rel}\left(Q, A^{\prime}\right)\right)}
$$
实验中取$\alpha=1$。

Experiments

$\mathbf{BERT_{LARGE}}$：baseline，使用GLUE数据进行finetuning
$\mathbf{MT-DNN_{no-fine-tuning}}$：不进行finetuning
ST-DNN：去除多任务学习，在每一个任务上单独finetuning，其与Bert的区别仅限于特定输出层的设计。

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding：

Conclusion

本文主要研究的是结合语言模型预训练和多任务学习获取文本表示，在多个公开数据集上取得了最好的效果，同时在领域迁移实验中也取得了很好的泛化能力。